Estimating optimal window size for analysis of low-coverage next-generation sequence data
نویسندگان
چکیده
MOTIVATION Current high-throughput sequencing has greatly transformed genome sequence analysis. In the context of very low-coverage sequencing (<0.1×), performing 'binning' or 'windowing' on mapped short sequences ('reads') is critical to extract genomic information of interest for further evaluation, such as copy-number alteration analysis. If the window size is too small, many windows will exhibit zero counts and almost no pattern can be observed. In contrast, if the window size is too wide, the patterns or genomic features will be 'smoothed out'. Our objective is to identify an optimal window size in between the two extremes. RESULTS We assume the reads density to be a step function. Given this model, we propose a data-based estimation of optimal window size based on Akaike's information criterion (AIC) and cross-validation (CV) log-likelihood. By plotting the AIC and CV log-likelihood curve as a function of window size, we are able to estimate the optimal window size that minimizes AIC or maximizes CV log-likelihood. The proposed methods are of general purpose and we illustrate their application using low-coverage next-generation sequence datasets from real tumour samples and simulated datasets. AVAILABILITY AND IMPLEMENTATION An R package to estimate optimal window size is available at http://www1.maths.leeds.ac.uk/∼arief/R/win/.
منابع مشابه
Optimizing Window Size and its Sunshade in Four Main Directions of Residential Buildings in Mild Climate by Integrating Thermal and Lighting Analysis
As part of sustainable architecture principles and practices, designers need to define building's architectural requirements based on climatic conditions, environmental preservation and reduction in energy consumption. The natural energy sources such as solar radiation affect thermal and lighting performances of buildings depending on its facade characteristics. Traditionally, buildings thermal...
متن کاملEstimating Sequence Similarity from Read Sets for Clustering Next-Generation Sequencing data
To cluster sequences given only their read-set representations, one may try to reconstruct each one from the corresponding read set, and then employ conventional (dis)similarity measures such as the edit distance on the assembled sequences. This approach is however problematic and we propose instead to estimate the similarities directly from the read sets. Our approach is based on an adaptation...
متن کاملEstimation of metagenome size and structure in an experimental soil microbiota from low coverage next-generation sequence data.
AIMS A major challenge in metagenome studies is to estimate the true size of all combined genomes. Here, we present a novel approach to estimate the size of all combined genomes for low coverage next-generation sequencing (NGS) data through empirically determined copy numbers of random DNA fragments. METHODS AND RESULTS Size estimates were made based on analyses of two experimental soil micro...
متن کاملPerformance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data.
Next-generation sequencing (NGS) is becoming a common approach for clinical testing of oncology specimens for mutations in cancer genes. Unlike inherited variants, cancer mutations may occur at low frequencies because of contamination from normal cells or tumor heterogeneity and can therefore be challenging to detect using common NGS analysis tools, which are often designed for constitutional g...
متن کاملStrategies and Clinical Applications of Next Generation Sequencing
Abstract DNA sequencing is one of the great valuable techniques in molecular biology, which can be used to detect the sequence of nucleotides in a DNA fragment. The high-throughput sequencing known as Next Generation Sequencing (NGS) revolutionized genomic research and molecular biology; therefore, the whole human genome can be sequenced with a low cost in several days. NGS technology is simi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Bioinformatics
دوره 30 13 شماره
صفحات -
تاریخ انتشار 2014